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Abstract 

We present a new boosting algorithm, moti- 
vated by the large margins theory for boost- 
ing. We give experimental evidence that the 
new algorithm is significantly more robust 
against label noise than existing boosting al- 
gorithm. 

1. Introduction 

Since the invention of Adaboost by Freund and 
Schapire (?; ?) it has become very popular with both 
theoreticians and practitioners of machine learning. 
Many variants of the algorithm have been devised. 

One of the most intriguing properties of Adaboost is 
the fact that it tends not to overfit. In many cases 
the test error of the generated classifier continues to 
decrease even after the training error has decreased to 
zero (?; ?; ?; ?). There are two main theories for ex- 
plaining this behaviour. The first is the large margins 
theory, proposed by Schapire et al (?). This theory 
is closely related to the theory of support vector ma- 
chines (SVM) (?). The focus of large margin theory 
is on the task of minimizing the classification error 
rate on the test set. The different theory, proposed by 
Friedman et al (?), related Adaboost to logistic regres- 
sion. The main focus of this theory is on maximizing 
the likelihood of a conditional probability distribution 
represented as a logistic function. The decrease in clas- 
sification error is seen as a by-product of the increase 
in the likelihood. 

One problem with Adaboost that has been realized 
early on is it's sensitivity to noise (?). The perfor- 
mance of Adaboost deteriorates rapidly when random 
label noise is added to the training set. Friedman et 
al proposed a variant of Adaboost, which they named 
gentle Adaboost or Logitboost, which is significantly 
better than Adaboost at tolerating label noise. Similar 
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algorithms to Logitboost are log-loss Boost proposed 
by Collins, Schapire and Singer (?) and M Adaboost, 
proposed by Domingo and Watanabe (?), as these al- 
gorithms are very similar to Logitboost, we will re- 
fer only to Logitboost from now on. While Adaboost 
puts unbounded large weights on mislabeled examples, 
the weight placed on any example by Logitboost is 
bounded. This decreases the penalty on mislabeled 
examples and increases the ability of the algorithm to 
tolerate noise. 

All of these algorithms can be described as methods 
for minimizing a potential function using gradient de- 
scent (?). Moreover, the potential function used by 
Adaboost, Logitboost, Logloss Boost and MAdaboost 
are all convex. The minimum of convex potential func- 
tions can be computed efficiently, which is the reason 
boosting is an efficient algorithm. However, Long and 
Servedio (?) prove that any boosting algorithm that is 
based on a convex potential function can be defeated 
by random label noise. They present a simple con- 
struction of a distribution that cannot be learned using 
such algorithms. 

In this paper we present a new boosting algorithm, 
which we call Robustboost, which is significantly more 
robust against label noise than either Adaboost or 
Logitboost. The new algorithm is based on the Fre- 
und's Boost-by-Majority algorithm (?) and Brown- 
boost (?). The algorithm is a potential based algo- 
rithm. However, the potential function is not convex 
and it changes during the boosting process. 

The paper is organized as follows. In Section [2] we de- 
scribe the potential based approach for learning linear 
discriminators and the problems associated with label 
noise and convex potential functions. In Section [3] we 
discuss the margin based explanation for Adaboost 's 
resistance to overfitting and how to apply it to prob- 
lems in which the data is not linearly separable. In 
Section H] we present the new boosting algorithm. In 
Section we give the experimental evidence that the 
new algorithm has superior robustness against label 
noise. We conclude in Section [6] 
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2. Learning Linear Discriminators 
under noise 

To simplify this explanation, we fix the set of base clas- 
sifiers and assume that the weak learner picks the base 
classifier with the smallest weighted error at each iter- 
ation. In this section we focus on the the problem of 
minimizing the number of mistakes that the combined 
classifier makes on the training set. The performance 
of the learned classifier on examples outside the train- 
ing set will be discussed in Section [3] 

We assume that we are given a training set S = 
((xi, yi), . . . , (xjv, Dn)) where Xi are the feature vec- 
tors and yi e {— are the binary labels. The 
classification rule generated by Adaboost is of the 
form c(x) = sign(^^ =1 aihi(x)), where ax, ■ ■ ■ ,ol<i are 
real numbers and hx, ■ ■ ■ ,hd is the fixed set of base 
rules. We assume that the range of the base classi- 
fiers is {— 1,+1}. As the base classifiers are fixed, we 
can represent the feature vector x by the d dimen- 
sional binary vector h — (hx(x), ■ ■ • , hd(x)). This re- 
duces the problem to that of learning a weights vector 
d = (ax, ■ ■ ■ ,ad) that defines a good linear discrimi- 
nator of the form sign(c? • h). 

We call the argument of the signum function the score 
junction six) = a • h. The (un-normalized) margin of 
an example (a;, y) is defined to be the product of the 
score function and the label m(x,y) = ys(x). Clearly 
m(x,y) > if and only if the classification c(x) is 
correct. Thus the indicator function 1 [m(xi,yi) < 0] 
is 1 if c(x) = y and otherwise, we call this indicator 
function the "error step function" . 

Our goal in this section is to find the linear discrim- 
inator which minimizes the training error, which can 
be expressed as 

1 N 

Ps[c(x) + y] = - 1 [m(xi,yi) < 0] (1) 
i=i 

If there exists a linear classifier whose training error is 
zero we say that the training data is linearly separable. 
In this case there are several provably efficient algo- 
rithms for finding the separating hyperplane (for ex- 
ample, the perceptron algorithm). On the other hand, 
when the training data is not linearly separable there 
is no known efficient algorithm for finding the vector a 
that minimizes the training error. In fact, it is known 
that this is an NP-hard, and that it is also NP-hard 
to find the linear classifier with the minimal number 
of disagreements or to approximate it within a factor 
of2-e(?). 

As finding good linear separators is a problem of great 
practical importance algorithms have been developed 
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Figure 1. Potential and weight functions for Adaboost and 
Logitboost 

that can find the optimal separating hyperplane un- 
der particular assumptions regarding the distribution 
of the examples. Two prominent examples are lin- 
ear discriminant analysis, which assumes that the two 
classes are normally distributed with equal covariance 
matrices and logistic regression which assumes that 
the conditional probability of the label given the in- 
put is described by the logistic function. Friedman et 
al (?) show that Logistic regression is closely related 
to Adaboost and to Logitboost. 

Logitboost and Adaboost can be represented using a 
potential function. The potential function $ is a de- 
creasing function of the margin m which upper bounds 
the error step function <!>(m) > 1 [to < 0]. As the av- 
erage potential is an upper bound on the classifica- 
tion error, decreasing it is a good heuristic for decreas- 
ing the classification error. The potential function for 
Adaboost is exp(— to) and the potential function for 
Logitboost is ln(l + exp(— to)). These functions are 
very close when m > but diverge when m is nega- 
tive because for m <C the potential for Logitboost is 
approximately linear ln(l + cxp(— m)) ~ —to. Mini- 
mizing these potential functions can be done very ef- 
fectively using gradient descent methods. In particular 
using the chain rule we get a simple expression for the 
derivative of the average potential w.r.t. Denoting 
m(xj,yj) by rrij we get 
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It is therefor natural to define a weight junction w(m) 
that is (minus) the derivative of the potential function 
with respect to m. Using this notation we get the 
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expression 
d 1 N 1 N 

which has the attractive interpretation that the deriva- 
tive of the average potential w.r.t. on is equal to corre- 
lation between hi{x) and y over the training examples, 
weighted by w(m(x, y)). The weights represent the rel- 
ative importance of different examples in reducing the 
average potential. The weight function for Adaboost 
is w(m) = exp(— m) and the weight function for Logit- 
boost is w(m) = l/(l+exp(m)). Note that the weights 
assigned by Adaboost rapidly increase to infinity when 
m < while the weights assigned by Logitboost are at 
most 1. 

Consider now what happens when we apply Adaboost 
or Logitboost to a linearly separable dataset to which 
we added independent label noise. In other words, we 
take a training set S which is perfectly classified with 
the a hyper-plane defined by a and flip the label of 
each example independently at random with probabil- 
ity q < 1/2. The training set is no longer linearly sepa- 
rable. However, the classifier a which previously sepa- 
rated the two classes is still almost optimal as its error 
rate is approximately q which is the lowest achievable 
error rate. On the other hand a is unlikely to be the 
point at which the average potential of Adaboost or 
Logitboost achieves it's minimum. Long and Serve- 
dio (?) give a rigorous proof of this fact. Here we give 
a short intuitive explanation. 

If there are examples in the clean dataset whose mar- 
gin m with respect to a is large, then a fraction of 
about q of these examples now have a margin m that 
is a large negative number. If the margin of such a 
noisy example is m <C then the potential assigned to 
it by Adaboost is exp(— m) and the potential assigned 
to it by Logitboost is about — m. Both potentials are 
larger than the error step function which they bound 
(see Figure [l| . Moving away from a can decrease the 
distance between the misclassified examples and the 
decision hyperplane, i.e. increase the corresponding 
negative margins, thereby decreasing the potential of 
those examples and therefor the overall average poten- 
tial. The potentials and the weights assigned to mis- 
labeled examples by Adaboost are much larger than 
those assigned by Logitboost. Therefor Logitboost is 
much more robust against label noise than Adaboost. 
However, Long and Servedio show that random label 
noise is a problem for any boosting algorithm that uses 
a convex potential function. The algorithm we propose 
in this paper is based on a non-convex potential func- 
tion. It thus even more robust the Logitboost against 



label noise. In Section |5.1| we give an experimental 
evidence that our algorithm is much more robust than 
Logitboost and Adaboost for the classification prob- 
lem suggested by Long and Servedio. 

3. The large margins theory 

In the previous section we focused on minimizing the 
training error. However, it is clear that Adaboost and 
Logitboost arc doing something more than minimizing 
training error. In many experiments (?; ?; ?; ?) the 
test error of the generated strong classifier continues to 
decrease for many boosting iterations after the training 
error becomes zero. Characterizing the criterion that 
Adaboost optimizes which is a better predictor of the 
test error than the training error is an important step 
towards finding a better boosting algorithm. To this 
end we use the large margins theory of Schapire et 
al (?). 

We use the following terminology. Recall that the 
definition of the un-normalized margin is m(x,y) = 
Oiihi(x)). We define the normalized margin 
to be m(x, y) = (ysign^ aA(x))) / (£\ |a;|) 

The generalization error of a the classifier c{x) is the 
probability that c{x) ^ y when (x, y) is generated 
by the underlying distribution D. Using the mar- 
gin notation we express the generalization error as 
Po[m(x) < 0]. We denote the optimal classification 
error by c*. 

The margin theory posits that large positive margins 
on training examples are predictive of small general- 
ization error. Specifically, Theorem 1 in (?) states 
that for any 9 > 0, with probability 1 — 8 over the 
random choice of the training set 

P D [fh{x) < 0] < P s [fh(x) < 9] (2) 
+ Q ^/ logiVlogrf + log(l/<^ 

where Pd[iti(x) < 0] is generalization error, i.e. the 
probability of making a mistake with respect to the 
underlying distribution, Ps[fh[x) < 9] is the fraction 
of the examples in the training set for which the margin 
is smaller than 9, d is the number of base rules and N 
is the size of the training set. 

The interpretation of this theorem is that to minimize 
the generalization error we should find linear classifier 
that minimizes the number of training examples for 
which rh(x) < 9 for a large value of 9. Note that 
varying 9 has opposite effects on the two term in the 
bound. Increasing 9 causes the first term to increase 
to 1 while decreasing 9 towards causes the second 
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term to blow up. 

Schapire et al show that Adaboost will tend to de- 
crease the bound given in Equation ([2| by increasing 
the value of 9 for which the term Ps[fh(x) < 9] is equal 
to zero. In other words, by maximizing the minimal 
margin. Breiman (?) and Grove and Schuurmans (?) 
give experimental evidence against this explanation for 
why Adaboost does not overfit. They maximized the 
minimal margin directly and showed that this does not 
tend to decrease the generalization error. 

However, setting the first term in Equation [2] to zero 
and minimizing the second term is often not the best 
way to minimize the bound. If the training set is 
not linearly separable it is impossible to set the term 
Ps[m(x) < 9] to zero for any 9 > 0. We can still get 
meaningful bounds from Equation [2] but in order to 
do that we need an algorithm for finding a weight vec- 
tor a for which m{Xi,yi) > 9 for most but not all of 
the training examples. 

We therefor redefine the goal of the boosting algo- 
rithm, instead of minimizing the number of mistakes 
on the training set, we define the goal as minimizing 
the number of examples whose normalized margins is 
smaller than some value 9 > 0. Stated using our de- 
fined notation, our goal is to minimize 

N 

{i/N)Y,nHxuyi)<e\. (3) 

i=l 

In the next section we describe the potential-based 
boosting algorithm for minimizing this target function. 

4. Robustboost 

Our proposed algorithm, which we call Robustboost, 
is a variation on Brownboost algorithm proposed by 
Freund (?) which, in turn, is based on Freund's Boost- 
by-Majority (BBM) algorithm (?). We give a brief 
description of Boost-by-majority and Brownboost and 
then describe Robustboost. 

BBM is based on the idea of finite horizon. The num- 
ber of boosting iterations is set in advance based on an 
error goal parameter e which is given to BBM as input. 
While all boosting algorithm give small weight to ex- 
amples with large positive margins, BBM gives small 
weight to examples with large negative margins on the 
later iterations. Intuitively, it "gives up" on examples 
which are so far on the incorrect side of the bound- 
ary that they are unlikely to be classifier correctly at 
the end. These examples become part of the train- 
ing error, which is quantified by e. As the weight for a 
given margin is the derivative of the potential, this im- 
plies that the potential has decreasing slope for large 



positive margins as well as large negative margins. In 
other words, the potential function is non-convex. 

One deficiency of BBM is that it is not adaptive. In 
other words, the base classifier added at each iteration 
is assigned a weight of one regardless of its accuracy (a 
base classifier is assigned a larger weight if it is added 
in multiple iterations). Brownboost is the adaptive 
version of BBM. It assigns based classifiers with small 
error larger weight than base classifiers with error close 
to 1/2. Brownboost uses a real valued variable called 
"time" and denoted by t. Before the first iteration 
t = and it is increased in each iteration. While 
BBM terminates after a pre-defined number of iter- 
ations, Brownboost terminates when t reaches a pre- 
defined value c. The parameter c defines the horizon of 
the boosting process and is pre-computed according to 
the target error e. As e - > 0, c — > oo and Brownboost 
becomes equivalent to Adaboost. In other words, Ada- 
boost is a special case of Brownboost where the target 
error is zero. 

The algorithm we propose here is very similar to 
Brownboost. The main difference is that instead of 
minimizing the training error it's goal is to mini- 
mize the margin-based cost function defined in Equa- 
tion ([3]). In order to minimize this cost function it 
needs to use a normalized weight vector. Designing an 
algorithm that would keep the L\ norm of the weight 
vector bounded proved difficult. Our solution is to 
normalize the weight vector so that the variance of 
the scores is bounded. In other words, the algorithm 
controls the weight vector a so that Var(a • h) is small. 
This is achieved by adding to the drift in the under- 
lying Brownian motion process a component which 
pushes the examples towards zero. This makes the 
underlying process equivalent to the mean-reverting 
Ornstein-Uhlenbeck process (see page 75 in (?)). 

We now describe the details of Robustboost. In our 
setup the range of the time variable is < t < 1. De- 
noting the margin by m we define the potential func- 
tion to be 

* M =l-erf(!^). (4) 
Where erf is the error function 



erf(a) 




/j,(t) and a(t) are defined by the equations 

a 2 (t) = (a 2 f + l)e 2 ^ -I (5) 

and 

fi(t) = (9 - 2p)e 1 - t + 2p (6) 
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Figure 2. The potential functions used by Robustboost as 
a function of t. The potential function at t = 1 is a close ap- 
proximation of the margins error function 1 [m(xi, J/j) < 6]. 

Where 0, <7/ and p are parameters of the algorithm that 
we will describe shortly. Taking the partial derivative 
of <j>(m,i) with respect to m we get that the weight 
function is 



w(m, t) = exp 



2a{tf 



(7) 



The parameter 9 > is the goal margin, as defined in 
Equation (|3j). It is set using cross-validation. Increas- 
ing 8 decreases the difference between the performance 
on the training set and on the validation set. The pa- 
rameter <jf > defines the slope of the step in the final 
potential function (see Figure |2|. We set 07 = 0.1 to 
avoid numerical instability when t is close to 1. 

Like Brownboost, Robustboost is a self-terminating al- 
gorithm. It terminates when t > 1. If the error goal e 
is set too small then Robustboost will not terminate. 
Setting the right value for e is done by searching for the 
minimal value of e for which the algorithm terminates 
within a reasonable number of iterations. Setting e 
determines the value of the parameter p. That value 
is the solution to the following equation 



e = $(0,0) = 1 - erf 




X)p-e6 



e 2 (a 2 f + 1) - 1 



(8) 



Robustboost 

parameters: e>0, 8 > 0, <Tf > 

training set: (xi,yi), . . . , (xn,Vn) where Xj G X, 

Vi €{-!,+!} 



set p to satisfy Equation (JsJ) . 
initialize k = 1, t\ := 0, Hq = Q 
m(l) := 0, . ..m(N) := 

repeat 

Define the distribution over the N training 
examples by normalizing w(m, t) defined in Equa- 
tions ( |7|5|6| 



w{m{j),t k ) 



N 



3=1 



call base learner and get hk ■ X — » {— 
which is slightly correlated with the label: 



E 



[Vjh k {xj)] > 



find Arnfe > 0, 1 — tk > Ai^ > that simultane- 
ously satisfy the following two equations: 

N 

^y j h k (x J )w(m'(j),t k +At k ) = or At k = l-t k 
3=1 

N N 

*(m(j) s t fc ) = $ ( m '(i)> ** + Ai fe) 

where 

m'(j) = m(j)e~ Atk + y. } h k {xj)Am k 

and $(-,•),!(;(•,•) are defined by Equa- 
tions ( |4|7|5|6 l. 

update: ffe + i := tfe + At k VI < j < N m(j) := 
m'(j) fffc = iifc_ie- Atfc + Am fc /ifc 
jfe := jfe+l 
until t k > 1. 

Output: the final hypothesis iifc. 
5. Experiments 

We report the results of two sets of experiments using 
synthetic data distributions. The first distribution is 
taken from Long and Servedio (?). The second is taken 
from Mease and Wyner (?). 

5.1. The Long/Servedio problem 

Long and Servedio suggested the following challenging 
classification problem. The input is a binary feature 
vector of length 21: (xi, . . . ,x 2 i) where Xi G { — 1,+1} 
and the output isj/€{— 1,+1}. A random example 
is generated as follows. First, the label y is chosen 
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with equal odds to be —1 or +1. Given y the features 
Xi are generated according to the following mixture 
distribution: 

• Large margin examples: With probability 1/4 
all of the Xi are set equal to the label y. 

• Pullers: With probability 1/4 we set the first 
eleven coordinates equal to the label x\ = xi = 
■ ■ ■ = x n = V and X12 = x 13 = ■ ■ ■ = x 2 i = -y. 

• Penalizers: With probability 1/2 we do the fol- 
lowing. Choose at random 5 coordinates out of 
the first 11 and 6 coordinates out of the last 10 
and set those equal to the label y. Set the remain- 
ing 10 coordinates to — y. 

It is easy to check that the majority vote rule 
sign(^ ■ Xi) is a perfect classifier for this data. To learn 
this classifier using boosting we define the base classi- 
fiers to be single coordinates, i.e. hi(X) = x^. Clearly, 
using these base classifiers the target classifier can be 
represented exactly. As this is a linearly separable dis- 
tribution, both Adaboost and Logitboost can learn it 
perfectly. 

However, if we add label noise to this problem, i.e. if 
we flip each label with probability 0.1 then the per- 
formance of both Adaboost and Logitboost degrades 
severely. The optimal rule remains the majority vote 
rule, but the average potential of this rule is very large 
and the minimal potential is achieved by a sub-optimal 
classifier. Long and Servedio show that any potential 
based boosting algorithm that uses a convex poten- 
tial function will fail to find a classifier whose training 
error is close to that of the majority vote rule. It is 
important to note that the failure of these learning al- 
gorithms demonstrates itself already on the training 
set and is a problem of wider-fitting, not overfitting, 
the training data. 

We compare the performance of Robustboost, Logit- 
boost and Adaboost on this problem. We run each 
boosting algorithm for 300 iterations. We generated 
10 datasets, each consisting of 800 training examples. 
We run each algorithm on each of the data sets and 
compute the training error for each case. We also com- 
pute the error with respect to the clean labels, which 
have not been corrupted by noise. We report the av- 
erage and standard deviation of these errors and their 
relative order for individual datasets. 

We set the parameters of Robustboost to be af = 0.1 
and e = 0.14. The value of e was chosen so that 
that Robustboost terminates after 200-300 iterations 
for most datasets. We tried two settings 9. In one set- 
ting 9 = 0, i.e. the goal of the algorithm is to minimize 
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Figure 3. The score distribution and the potential func- 
tions for Logitboost on a Long/Servedio training set on 
iteration 100. 

the training error. In the other setting 9 = 0.2 which 
means that the goal of the algorithm is to minimize 
the number of training examples whose margin is less 
than 0.2. The numbers in this and the following tables 
corresponds to percent errors (i.e. there are numbers 
between and 100). 
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Robust 
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Err 
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13.2 ± 1.3 


10.0 ± 1.3 
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24.3 ±1.7 


22.6 ± 1.7 


5.5 ±2.5 


0.9 ± 1.0 



The relative order of the errors was always the same. 
The error of Robustboost is far smaller than that of 
Logitboost, which is slightly better than that of Ada- 
boost. Surprisingly, the error of Robustboost is further 
improved when we set 9 = 0.2. The difference is even 
more pronounced when comparing the predictions of 
the classifier to the noiseless labels. In particular, the 
error of Robustboost is about 1.0% even though the 
data on which it was trained has 10% error with re- 
spect to the clean data. In other words, Robustboost 
is able to detect and correct most of the mislabeled 
examples. 

We can gain insight into the reasons that Robustboost 
succeeds while Adaboost and Logitboost fail by com- 
paring the evolution of the score distributions for the 
different algorithms. In figure [3] we show the score 
distribution to which Logitboost converges after 100 
iterations, this distribution changes very little from it- 
eration 100 to iteration 300. What we see is that the 
algorithm converged to a minimal potential vector a 
in which the large margin examples and the pullers 
are well separated, but the penalizers are distributed 
more or less randomly. The reason is that the misla- 
beled large margins and pullers have relatively large 
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weights (the derivative of he potential is close to one) 
while the weight of each individual penalizer is small. 
As the penalizers are sparse, they cannot "pull" a from 
the direction suggested by the pullers and large mar- 
gin examples and so about half Of them are mislabeled, 
contributing about 25$ to the training error. 

Contrast this with the score distributions shown in 
Figure [4j After 100 boosting iterations the potential 
is such that the weight of large margin examples is 
close to zero whether or not they are mislabeled and the 
weight of mislabeled pullers is smaller than it was with 
Logitboost. This means that the algorithm ignores 
the large margin examples and concentrates on the 
pullers and the penalizers, without giving the pullers 
too much weight. The result is that after 200 iterations 
many of the penalizers are classified correctly and the 
pullers are mixed in with the penalizers. Note that for 
the ideal solution the margins of pullers and penalizers 
that are not mislabeled is equal to 11/21 — 1/2 = 1/42. 
Our main point is that Robustboost avoids the incor- 
rect mini ma that trap Adaboost and Logitboost by 
ignoring examples with large negative margins. 

5.2. The Mease/ Wyner problem 

In this section we report results of experimental com- 
parisons using synthetic distributions analyzed by 
Mease and Wyner (?). In this case a majority vote 
over the base classifier can only approximate the target 
classifier which significantly complicates the problem. 

The input to this classification problem is a d — 20 di- 
mensional vector x where each coordinate of x is cho- 
sen IID according to the uniform distribution over the 
segment [0, 1]. The label x is 1 if Y^=i x i — 2-5 and —1 
otherwise. The base classifiers we tested are decision 
stumps, i.e. rules of the form sign(iEi — a{) or a 2 level 
decision tree made out of these decision rules. Unlike 
the Long and Servedio distribution, a finite number 
of these base classifiers cannot exactly represent the 
target rule. Mease and Wyner use this distribution to 
compare the effects of random label noise on Adaboost 
and Logitboost. We add Robustboost to the compar- 
ison. 

In each experiment we use 2000 training examples and 
2000 test examples (we tried 200 examples but the 
between-experiment variation was too large to draw 
significant conclusions). We repeat each experiment 
15 times and report the mean and standard deviation 
of the error on the test set. We tried two levels of 
random noise q = 0.1 and q = 0.2%. The boosting 
algorithms are run for (at most) 500 iterations. For 
Robustboost we use: 6 — 1.0, a = 0.1, for q = 0.1 
e = 0.15, for q — 0.2 e = 0.25. For these settings of e 
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Figure 4. The score distribution and the potential func- 
tions for Robustboost on a Long/Servedio training set on 
iterations 100 (top) and 200 (bottom). 



Robustboost terminates after 100-300 iterations. 



base 


q 


Ada 


Logit 


Robust 


stump 


10 


19.3 ±1.0 


15.9 ±0.9 


13.5 ±0.8 


2tree 


10 


21.4 ±1.2 


18.7 ± 1.1 


14.8 ±1.0 


stump 


20 


29.4 ±1.2 


26.7 ± 1.3 


23.8 ± 1.1 


2Tree 


20 


32.3 ±1.0 


29.3 ±0.8 


25.3 ±0.9 



As in the previous section Robustboost performs sig- 
nificantly better than Logitboost which is better than 
Adaboost. The relative performance of the algorithms 
is consistent with this table in all 15 repeats of the 
experiment. Using 2 level decision trees is consistently 
worse than using stumps. While significant, the differ- 
ence between Robustboost and Logitboost is smaller 
than it was in the previous section, we conjecture that 
this is because the distribution here is much more sym- 
metric which decreases the biasing effect of the exam- 
ples with large negative margins. 

Continuing only with stumps, we report the error of 
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the generated classifiers relative to the uncorrupted 
labels. The relative performance here is the same, with 
Robustboost leading the way. 



q 


Ada 


Logit 


Robust 


10 


11.5 ±1.1 


7.1 ±0.7 


4.3 ±0.4 


20 


15.6 ±1.2 


11.2 ± 1.0 


6.5 ±1.0 



A potentially more important aspect of the classifier 
generated by Robustboost is that it's predictions that 
are given with large margins are very trustworthy. In 
the following table we report the fraction of the test 
set on which the absolute value of the score is smaller 
than 9 (low margin examples) and the error rate on 
the remaining examples relative to the uncorrupted 
test data. 



q 


low margin 


clean err 


10 


10.5 ±0.6 


0.9 ±0.2 


20 


10.2 ±0.7 


2.4 ±0.6 



Once more, we see that Robustboost is capable of de- 
tecting most of the incorrect labels for the examples 
with large margins. 

6. conclusions 

We present evidence that Robustboost is more robust 
against label noise than either Logitboost or Ada- 
boost. More experiments using synthetic and real- 
world datasets are needed to verify this claim. 

The effectiveness of Robustboost suggest that after an 
approximate classifier has been learned it can be ben- 
eficial to down-weight examples that are far from the 
decision boundary regardless of their label. This sug- 
gests new directions for active learning which we are 
currently investigating. 
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